Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
confusion_matrix,
)
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
Source: MT_Project_LearnerNotebook_LowCode.ipynb
from google.colab import drive
drive.mount('/content/drive')
wind=pd.read_csv('/content/drive/MyDrive/Train.csv.csv')
Mounted at /content/drive
wind.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
40 variables with 18 values missing for v1 and v2. These must be given values in order to run the models. All values are type float except for Target, which is classified as a integer. All variables are given coded names. Therfore, recommendations must be given using codes alone.
wind.shape
(20000, 41)
20,000 rows and 40 columns.
wind.isnull().sum()
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
As stated earlier, there are 18 missing values in v1 and v2.
wind.head(10)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.465 | -4.679 | 3.102 | 0.506 | -0.221 | -2.033 | -2.911 | 0.051 | -1.522 | 3.762 | -5.715 | 0.736 | 0.981 | 1.418 | -3.376 | -3.047 | 0.306 | 2.914 | 2.270 | 4.395 | -2.388 | 0.646 | -1.191 | 3.133 | 0.665 | -2.511 | -0.037 | 0.726 | -3.982 | -1.073 | 1.667 | 3.060 | -1.690 | 2.846 | 2.235 | 6.667 | 0.444 | -2.369 | 2.951 | -3.480 | 0 |
| 1 | 3.366 | 3.653 | 0.910 | -1.368 | 0.332 | 2.359 | 0.733 | -4.332 | 0.566 | -0.101 | 1.914 | -0.951 | -1.255 | -2.707 | 0.193 | -4.769 | -2.205 | 0.908 | 0.757 | -5.834 | -3.065 | 1.597 | -1.757 | 1.766 | -0.267 | 3.625 | 1.500 | -0.586 | 0.783 | -0.201 | 0.025 | -1.795 | 3.033 | -2.468 | 1.895 | -2.298 | -1.731 | 5.909 | -0.386 | 0.616 | 0 |
| 2 | -3.832 | -5.824 | 0.634 | -2.419 | -1.774 | 1.017 | -2.099 | -3.173 | -2.082 | 5.393 | -0.771 | 1.107 | 1.144 | 0.943 | -3.164 | -4.248 | -4.039 | 3.689 | 3.311 | 1.059 | -2.143 | 1.650 | -1.661 | 1.680 | -0.451 | -4.551 | 3.739 | 1.134 | -2.034 | 0.841 | -1.600 | -0.257 | 0.804 | 4.086 | 2.292 | 5.361 | 0.352 | 2.940 | 3.839 | -4.309 | 0 |
| 3 | 1.618 | 1.888 | 7.046 | -1.147 | 0.083 | -1.530 | 0.207 | -2.494 | 0.345 | 2.119 | -3.053 | 0.460 | 2.705 | -0.636 | -0.454 | -3.174 | -3.404 | -1.282 | 1.582 | -1.952 | -3.517 | -1.206 | -5.628 | -1.818 | 2.124 | 5.295 | 4.748 | -2.309 | -3.963 | -6.029 | 4.949 | -3.584 | -2.577 | 1.364 | 0.623 | 5.550 | -1.527 | 0.139 | 3.101 | -1.277 | 0 |
| 4 | -0.111 | 3.872 | -3.758 | -2.983 | 3.793 | 0.545 | 0.205 | 4.849 | -1.855 | -6.220 | 1.998 | 4.724 | 0.709 | -1.989 | -2.633 | 4.184 | 2.245 | 3.734 | -6.313 | -5.380 | -0.887 | 2.062 | 9.446 | 4.490 | -3.945 | 4.582 | -8.780 | -3.383 | 5.107 | 6.788 | 2.044 | 8.266 | 6.629 | -10.069 | 1.223 | -3.230 | 1.687 | -2.164 | -3.645 | 6.510 | 0 |
| 5 | 0.160 | -4.234 | -0.264 | -5.477 | -0.191 | -0.356 | -0.134 | 4.067 | -3.859 | 1.692 | 0.138 | 3.975 | 0.673 | 1.878 | 0.764 | 4.236 | -2.129 | 2.348 | -2.147 | -0.982 | 0.386 | 1.011 | 3.419 | 0.996 | 0.061 | -3.037 | 1.788 | -1.727 | 0.308 | 1.902 | 4.666 | 3.227 | 0.629 | -1.549 | 1.322 | 5.461 | 1.109 | -3.870 | 0.274 | 2.806 | 0 |
| 6 | -0.185 | -4.721 | 0.865 | -3.079 | -2.227 | -1.282 | -0.805 | 3.290 | -1.568 | 0.750 | 0.529 | 3.221 | 2.945 | 1.724 | -0.923 | 2.535 | -1.697 | 0.677 | -0.246 | 2.748 | -1.165 | 0.248 | 1.161 | -2.850 | 0.503 | -3.532 | 1.861 | -1.465 | 0.874 | 2.418 | 0.939 | -0.545 | -0.763 | 0.816 | 1.889 | 3.624 | 1.556 | -5.433 | 0.679 | 0.465 | 0 |
| 7 | 1.735 | 1.683 | -1.269 | 4.601 | -1.417 | -2.544 | 0.132 | -0.199 | 3.094 | -1.109 | -1.662 | 0.944 | 3.481 | 0.137 | -3.473 | -4.076 | 1.727 | -1.909 | 3.569 | 2.512 | -4.579 | 3.063 | 3.686 | 0.611 | -0.430 | 0.880 | -0.994 | 1.134 | -3.768 | -0.692 | -5.244 | 1.717 | -3.839 | 1.569 | 1.795 | -4.269 | -0.516 | -0.619 | -0.831 | -4.967 | 1 |
| 8 | 1.782 | 1.315 | 4.249 | -0.518 | -0.149 | 0.033 | -1.088 | -3.118 | 0.625 | 1.567 | -0.415 | -1.401 | 2.607 | -1.024 | -2.878 | -4.524 | -4.354 | 0.107 | 1.299 | -3.596 | -5.409 | 0.633 | -3.043 | 0.965 | -0.266 | 4.671 | 1.847 | -2.321 | -1.318 | -0.682 | 3.281 | 1.611 | 2.951 | -1.862 | 4.390 | 1.371 | -2.516 | 0.770 | 0.831 | -2.311 | 0 |
| 9 | -0.894 | 4.011 | 5.252 | 3.321 | 0.727 | -4.771 | 1.031 | 3.632 | -1.391 | -1.967 | -4.779 | 6.617 | -0.148 | -2.513 | 0.734 | 0.475 | 5.085 | -2.361 | 4.561 | 2.287 | -2.307 | -0.949 | -0.301 | 2.546 | 0.738 | 4.266 | -4.145 | -0.013 | -1.469 | -2.003 | 1.680 | -0.636 | -4.449 | 2.296 | 1.575 | 1.376 | 0.597 | -1.414 | 0.544 | 0.035 | 0 |
Values range from negative to positive numbers. Instructions for the project stated that the negative numbers need not be transformed.
wind.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19982.000 | 19982.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 |
| mean | -0.272 | 0.440 | 2.485 | -0.083 | -0.054 | -0.995 | -0.879 | -0.548 | -0.017 | -0.013 | -1.895 | 1.605 | 1.580 | -0.951 | -2.415 | -2.925 | -0.134 | 1.189 | 1.182 | 0.024 | -3.611 | 0.952 | -0.366 | 1.134 | -0.002 | 1.874 | -0.612 | -0.883 | -0.986 | -0.016 | 0.487 | 0.304 | 0.050 | -0.463 | 2.230 | 1.515 | 0.011 | -0.344 | 0.891 | -0.876 | 0.056 |
| std | 3.442 | 3.151 | 3.389 | 3.432 | 2.105 | 2.041 | 1.762 | 3.296 | 2.161 | 2.193 | 3.124 | 2.930 | 2.875 | 1.790 | 3.355 | 4.222 | 3.345 | 2.592 | 3.397 | 3.669 | 3.568 | 1.652 | 4.032 | 3.912 | 2.017 | 3.435 | 4.369 | 1.918 | 2.684 | 3.005 | 3.461 | 5.500 | 3.575 | 3.184 | 2.937 | 3.801 | 1.788 | 3.948 | 1.753 | 3.012 | 0.229 |
| min | -11.876 | -12.320 | -10.708 | -15.082 | -8.603 | -10.227 | -7.950 | -15.658 | -8.596 | -9.854 | -14.832 | -12.948 | -13.228 | -7.739 | -16.417 | -20.374 | -14.091 | -11.644 | -13.492 | -13.923 | -17.956 | -10.122 | -14.866 | -16.387 | -8.228 | -11.834 | -14.905 | -9.269 | -12.579 | -14.796 | -13.723 | -19.877 | -16.898 | -17.985 | -15.350 | -14.833 | -5.478 | -17.375 | -6.439 | -11.024 | 0.000 |
| 25% | -2.737 | -1.641 | 0.207 | -2.348 | -1.536 | -2.347 | -2.031 | -2.643 | -1.495 | -1.411 | -3.922 | -0.397 | -0.224 | -2.171 | -4.415 | -5.634 | -2.216 | -0.404 | -1.050 | -2.433 | -5.930 | -0.118 | -3.099 | -1.468 | -1.365 | -0.338 | -3.652 | -2.171 | -2.787 | -1.867 | -1.818 | -3.420 | -2.243 | -2.137 | 0.336 | -0.944 | -1.256 | -2.988 | -0.272 | -2.940 | 0.000 |
| 50% | -0.748 | 0.472 | 2.256 | -0.135 | -0.102 | -1.001 | -0.917 | -0.389 | -0.068 | 0.101 | -1.921 | 1.508 | 1.637 | -0.957 | -2.383 | -2.683 | -0.015 | 0.883 | 1.279 | 0.033 | -3.533 | 0.975 | -0.262 | 0.969 | 0.025 | 1.951 | -0.885 | -0.891 | -1.176 | 0.184 | 0.490 | 0.052 | -0.066 | -0.255 | 2.099 | 1.567 | -0.128 | -0.317 | 0.919 | -0.921 | 0.000 |
| 75% | 1.840 | 2.544 | 4.566 | 2.131 | 1.340 | 0.380 | 0.224 | 1.723 | 1.409 | 1.477 | 0.119 | 3.571 | 3.460 | 0.271 | -0.359 | -0.095 | 2.069 | 2.572 | 3.493 | 2.512 | -1.266 | 2.026 | 2.452 | 3.546 | 1.397 | 4.130 | 2.189 | 0.376 | 0.630 | 2.036 | 2.731 | 3.762 | 2.255 | 1.437 | 4.064 | 3.984 | 1.176 | 2.279 | 2.058 | 1.120 | 0.000 |
| max | 15.493 | 13.089 | 17.091 | 13.236 | 8.134 | 6.976 | 8.006 | 11.679 | 8.138 | 8.108 | 11.826 | 15.081 | 15.420 | 5.671 | 12.246 | 13.583 | 16.756 | 13.180 | 13.238 | 16.052 | 13.840 | 7.410 | 14.459 | 17.163 | 8.223 | 16.836 | 17.560 | 6.528 | 10.722 | 12.506 | 17.255 | 23.633 | 16.692 | 14.358 | 15.291 | 19.330 | 7.467 | 15.290 | 7.760 | 10.654 | 1.000 |
wind["Target"].value_counts(normalize=True)
0 0.945 1 0.056 Name: Target, dtype: float64
Observations
wind.nunique()
V1 19982 V2 19982 V3 20000 V4 20000 V5 20000 V6 20000 V7 20000 V8 20000 V9 20000 V10 20000 V11 20000 V12 20000 V13 20000 V14 20000 V15 20000 V16 20000 V17 20000 V18 20000 V19 20000 V20 20000 V21 20000 V22 20000 V23 20000 V24 20000 V25 20000 V26 20000 V27 20000 V28 20000 V29 20000 V30 20000 V31 20000 V32 20000 V33 20000 V34 20000 V35 20000 V36 20000 V37 20000 V38 20000 V39 20000 V40 20000 Target 2 dtype: int64
No indication of unique identifiers that would need to be dropped from the dataset. All the values in this set with the exception of the target variable are continuous numeric values.
sns.set_style("darkgrid")
wind.hist(figsize=(20,15))
plt.show()
Source: InnHotels Learner Notebook, Full Code
num_cols=wind.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(10,10))
plt.boxplot(wind[num_cols],whis=1.5)
plt.show()
Observation:The majority of these variables contain non-continuous outliers. If these outliers are not treated, it may be difficult to get a good generalized model. However, these outliers also represent genuine values. Therefore, removing them may not provide a true picture of conditions in the environment.
def histogram_boxplot(data, feature, figsize=(15, 10), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
for feature in wind.columns:
histogram_boxplot(wind, feature, figsize=(12, 7), kde=True, bins=None) ## Please change the dataframe name as you define while reading the data
Source: MT Project Learner Notebook, Low Code
Observations:
plt.figure(figsize=(35,35))
sns.heatmap(data=wind[["V1","V2","V3","V4","V5","V6","V7",
"V8","V9","V10","V11","V12","V13","V14","V15",
"V16","V17","V18","V19","V20","V21","V22","V23",
"V24","V25","V26","V27","V28","V29","V30","V31","V32",
"V33","V34","V35","V36","V37","V38","V39","V40","Target"]]
.corr(),annot=True,cbar=False,cmap="Spectral")
<Axes: >
*Source: Video, Intro to Python with Daniel Mitchell: 3.5, Heatmap
Correlational Analysis: This analysis will focus on variables that are highly correlated, those with a correlational coefficient of 0.70 or -0.70 or grearer.
#Always make a copy before manipulation.
wind2=wind.copy()
X = wind2.drop("Target",axis=1)
y = wind2.pop("Target")
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 dtypes: float64(40) memory usage: 6.1 MB
y.info()
<class 'pandas.core.series.Series'> RangeIndex: 20000 entries, 0 to 19999 Series name: Target Non-Null Count Dtype -------------- ----- 20000 non-null int64 dtypes: int64(1) memory usage: 156.4 KB
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
# fit the imputer on train data and transform the train data
X_train["V1"] = imp_median.fit_transform(X_train[["V1"]])
X_train["V2"] = imp_median.fit_transform(X_train[["V2"]])
# transform the validation data using the imputer fit on validation data
X_val["V1"] = imp_median.fit_transform(X_val[["V1"]])
X_val["V2"] = imp_median.fit_transform(X_val[["V2"]])
Source: Hyperparameter Tuning with Professor Rao: 1.5 Handson Oversampling and Undersampling
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("LogR", LogisticRegression(random_state=1)))
models.append(("DTree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("RF",RandomForestClassifier(random_state=1)))
models.append(("GB", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: LogR: 0.4902481389578163 DTree: 0.7078246484698097 AdaBoost: 0.6434656741108354 Bagging: 0.707808105872622 RF: 0.7194127377998345 GB: 0.7220016542597187 Validation Performance: LogR: 0.5015015015015015 DTree: 0.7057057057057057 AdaBoost: 0.6516516516516516 Bagging: 0.7267267267267268 RF: 0.7357357357357357 GB: 0.7357357357357357
Source: MT Project LearnerNotebook Low Code
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Source: MT Project Learner Notebook Low Code
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
Observations: Recall is the metric that compares TPs to FNs, and was the metric used to determine the best model in this group. Failure to predict when a component will fail is more costly than predicting a component will fail when it doesn't fail.This company wants to replace comonents before they fail to prevent shutdowns in energy production. In order to prevent component failure before replacement, each model should maximize recall scores. Of these six models, the model with the highest recall score and the least amount of overfitting is GB or Gradiant Boosting with a training score of 72.20 and a validation score of 73.57. Recall is sensitive to class imbalance. Therefore, over and undersampling will likely increase recall in all of these models. Very little symmetry in all the distributions.
Sample Decision Tree model building with original data
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 777 Before OverSampling, counts of label '0': 13223 After OverSampling, counts of label '1': 13223 After OverSampling, counts of label '0': 13223 After OverSampling, the shape of train_X: (26446, 40) After OverSampling, the shape of train_y: (26446,)
Source: MT Project Learner Notebook Low Code
Observation:As stated in the EDA, the balance between failure and not-failure was 5% to 95%. After oversampling, the balance is 50%/50%.
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("LogRO", LogisticRegression(random_state=1)))
models.append(("BaggingO", BaggingClassifier(random_state=1)))
models.append(("DTreeO", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostO", AdaBoostClassifier(random_state=1)))
models.append(("RFO", RandomForestClassifier(random_state=1)))
models.append(("GBO", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
) ## Complete the code to build models on oversampled data
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over,y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: LogRO: 0.8917044404851445 BaggingO: 0.975119441528989 DTreeO: 0.970128321355339 AdaBoostO: 0.904787470436327 RFO: 0.9829090368319754 GBO: 0.9329201902370526 Validation Performance: LogRO: 0.8498498498498499 BaggingO: 0.8258258258258259 DTreeO: 0.7837837837837838 AdaBoostO: 0.8618618618618619 RFO: 0.8558558558558559 GBO: 0.8768768768768769
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Observations: The best fit model after oversampling and the most symmetrical distribution was LogRO with training/validation recall comparison scores of 89.17/84.98. AdaBoostO is a slightly better fit model with a training/validation recall score comparison of 90.47/86.18. However, the distribution is not as symmetrical. DTreeO has the most symmetical distribution. However, with training/validatioon recall scores of 97.01/78.37, it is one of the most overfit models.
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 777 Before UnderSampling, counts of label '0': 13223 After UnderSampling, counts of label '1': 777 After UnderSampling, counts of label '0': 777 After UnderSampling, the shape of train_X: (1554, 40) After UnderSampling, the shape of train_y: (1554,)
MT Project Learner Notebook Low Code
Observation: Just like the last round, the class imbalance before undersampling is 5.5%/94.5%.
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("LogRU", LogisticRegression(random_state=1)))
models.append(("BaggingU", BaggingClassifier(random_state=1)))
models.append(("DTreeU", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostU", AdaBoostClassifier(random_state=1)))
models.append(("RFU", RandomForestClassifier(random_state=1)))
models.append(("GBU", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
) ## Complete the code to build models on oversampled data
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un,y_train_un)## Complete the code to build models on oversampled data
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: LogRU: 0.8726220016542598 BaggingU: 0.880339123242349 DTreeU: 0.8622167080231596 AdaBoostU: 0.8725971877584782 RFU: 0.9034822167080232 GBU: 0.8932009925558313 Validation Performance: LogRU: 0.8468468468468469 BaggingU: 0.8708708708708709 DTreeU: 0.8408408408408409 AdaBoostU: 0.8588588588588588 RFU: 0.8828828828828829 GBU: 0.8828828828828829
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Observations: BaggingU is the closest fit model so far with a training/validation recall score of 88.03/87.08. Unfortunately, the presence of outliers skews the distribution. Just like the base models, none of the undersampled models
*Comparing the modles so far
Model Comparison so far
comparison_frame1 = pd.DataFrame({'Base Model':['LogR','DTree','AdaBoost','Bagging',
'RF','GB'],
'Train_Recall':[0.49,0.70,0.64,0.70,0.71,0.72],
'Val_Recall':[0.50,0.70,0.65,0.72,0.73,0.73]})
comparison_frame2=pd.DataFrame({'Oversample':['LogRO','BaggingO','DTreeO','AdaBoostO','RFO','GBO'],
'Train Recall':[0.89,0.97,0.97,0.90,0.98,0.93],
'Val_Recall':[0.84,0.82,0.78,0.86,0.85,0.87],})
comparison_frame3=pd.DataFrame({'Undersample':['LorRU','DTreeU','AdaBoostU','BaggingU','RFU','GBU'],
'Train_Recall':[0.87,0.88,0.86,0.87,0.90,0.89],
'Test_Recall':[0.84,0.87,0.84,0.85,0.88,0.88]})
Source: EasyVisa Learner Notebook Full Code
comparison_frame1
| Base Model | Train_Recall | Val_Recall | |
|---|---|---|---|
| 0 | LogR | 0.490 | 0.500 |
| 1 | DTree | 0.700 | 0.700 |
| 2 | AdaBoost | 0.640 | 0.650 |
| 3 | Bagging | 0.700 | 0.720 |
| 4 | RF | 0.710 | 0.730 |
| 5 | GB | 0.720 | 0.730 |
comparison_frame2
| Oversample | Train Recall | Val_Recall | |
|---|---|---|---|
| 0 | LogRO | 0.890 | 0.840 |
| 1 | BaggingO | 0.970 | 0.820 |
| 2 | DTreeO | 0.970 | 0.780 |
| 3 | AdaBoostO | 0.900 | 0.860 |
| 4 | RFO | 0.980 | 0.850 |
| 5 | GBO | 0.930 | 0.870 |
comparison_frame3
| Undersample | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | LorRU | 0.870 | 0.840 |
| 1 | DTreeU | 0.880 | 0.870 |
| 2 | AdaBoostU | 0.860 | 0.840 |
| 3 | BaggingU | 0.870 | 0.850 |
| 4 | RFU | 0.900 | 0.880 |
| 5 | GBU | 0.890 | 0.880 |
Observations:The four models with the best recall train/val scores are DTreeU (0.88/0.87), AdaBoostU (0.86/0.84), BaggingU (0.87/0.85) and GBU (0.89/0.88).
LogRU
# defining model
LogRO_tuned = LogisticRegression(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'C':np.arange(0.1,1.1,0.1)}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=LogRO_tuned, param_distributions=param_grid,
n_iter=10, n_jobs = -1, verbose=2,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'C': 0.2} with CV score=0.8920823693264202:
Sources: Easy Visa Learner Notebook Full Code and MT Learner Notebook Full Code
LogRO_tuned.get_params()
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': 1,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
Source: Practice Notebook Hyperparameter Tuning
# Set the clf to the best combination of parameters
LogRO_best = LogisticRegression(
C=0.2,
class_weight="balanced",
dual=False,
fit_intercept=True,
l1_ratio=1,
max_iter=100,
multi_class="auto",
n_jobs=None,
random_state=1,
solver='lbfgs',
tol=0.0001,
verbose=0,
warm_start=False)
# Fit the best algorithm to the data.
LogRO_best.fit(X_train_over, y_train_over)
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, LogRO_best.predict(X_train_over)))
print(accuracy_score(y_val, LogRO_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, LogRO_best.predict(X_train_over)))
print(recall_score(y_val, LogRO_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, LogRO_best.predict(X_train_over)))
print(precision_score(y_val, LogRO_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, LogRO_best.predict(X_train_over)))
print(f1_score(y_val, LogRO_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.8856159721697043 0.8678333333333333 Recall on train and validation set 0.8919307267639719 0.8498498498498499 Precision on train and validation set 0.8808065720687079 0.27582846003898637 F1 on train and validation set 0.8863337466651636 0.4164827078734364
# defining model
DTreeU_tuned = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=DTreeU_tuned, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score=0.8506368899917287:
Sources: Easy Visa Project Learner Notebook Full Code and MT Project Learner Notebook Full Code
DTreeU_tuned.get_params()
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 1,
'splitter': 'best'}
DTreeU_best = DTreeU_tuned
# Fit the best algorithm to the data.
DTreeU_best.fit(X_train_un, y_train_un)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
Note:I had to fit this a second time in order to get a recall score because when I ran "DTreeU_tuned" through, Python said it hadn't been fitted yet.
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(accuracy_score(y_val, DTreeU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(recall_score(y_val, DTreeU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(precision_score(y_val, DTreeU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(f1_score(y_val, DTreeU_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.890607275202299 0.8315 Recall on train and validation set 0.9316342736141572 0.8408408408408409 Precision on train and validation set 0.8609868604976237 0.22617124394184168 F1 on train and validation set 0.8949184555591879 0.35646085295989816
AdaBoostU
# defining model
AdaBoostU_tuned = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators':np.arange,
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=DTreeU_tuned, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Source: Easy Visa Project Learner Notebook Full Code
AdaBoostClassifier().get_params()
{'algorithm': 'SAMME.R',
'base_estimator': 'deprecated',
'estimator': None,
'learning_rate': 1.0,
'n_estimators': 50,
'random_state': None}
# defining model
AdaBoostU_tuned = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={'learning_rate': [0.001,0.01,0.1,1.0],
'n_estimators': [50,100,150,200],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=AdaBoostU_tuned, param_distributions=param_grid,
n_iter=10, n_jobs = -1, verbose=2,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'n_estimators': 100, 'learning_rate': 1.0} with CV score=0.8854259718775849:
Sources: Easy Visa Project Learner Notebook Full Code
# Creating new pipeline with best parameters
AdaBoostU_tuned_best = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, base_estimator= DecisionTreeClassifier(min_samples_leaf=1,
min_impurity_decrease=0.001, max_leaf_nodes=10,max_depth=5,random_state=1))
## Complete the code with the best parameters obtained from tuning
AdaBoostU_tuned_best.fit(X_train_un,y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
max_leaf_nodes=10,
min_impurity_decrease=0.001,
random_state=1),
n_estimators=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
max_leaf_nodes=10,
min_impurity_decrease=0.001,
random_state=1),
n_estimators=100)DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)AdaBoost_best=AdaBoostU_tuned_best
AdaBoost_best.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
max_leaf_nodes=10,
min_impurity_decrease=0.001,
random_state=1),
n_estimators=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
max_leaf_nodes=10,
min_impurity_decrease=0.001,
random_state=1),
n_estimators=100)DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)NOTE:I had to fit this twice to get a recall score because the first time, I got an error that said the model had not been fitted.
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(accuracy_score(y_val, AdaBoost_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(recall_score(y_val, AdaBoost_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(precision_score(y_val, AdaBoost_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(f1_score(y_val, AdaBoost_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.9537548211449747 0.9345 Recall on train and validation set 0.9708084398396732 0.8738738738738738 Precision on train and validation set 0.9387889425186485 0.4532710280373832 F1 on train and validation set 0.9545302450087371 0.596923076923077
GBU (Gradiant Boost with undersampling)
# defining model
GBU_tuned = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators":np.arange(100,150,25),
"learning_rate":[0.2,0.05,1.0],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=GBU_tuned, param_distributions=param_grid,
n_iter=10, n_jobs = -1, verbose=2,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters are {'n_estimators': 100, 'learning_rate': 0.2} with CV score=0.902191894127378:
Source: Easy Visa Project Learner Notebook Full Code and MT Project Learner Notebook Full Code
GBU_best = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.2,
random_state=1)
# Fit the best algorithm to the data.
GBU_best.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=0.2, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(learning_rate=0.2, random_state=1)
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, GBU_best.predict(X_train_over)))
print(accuracy_score(y_val, GBU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, GBU_best.predict(X_train_over)))
print(recall_score(y_val, GBU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, GBU_best.predict(X_train_over)))
print(precision_score(y_val, GBU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, GBU_best.predict(X_train_over)))
print(f1_score(y_val, GBU_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.9709218785449596 0.9708333333333333 Recall on train and validation set 0.9553807759207441 0.8678678678678678 Precision on train and validation set 0.9860287230721199 0.6880952380952381 F1 on train and validation set 0.9704628384866525 0.7675962815405046
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
comparison_frame = pd.DataFrame({'Best Models':['LogRO_best','DTreeU_best','AdaBoost_best','GBU_best'],
'Train_Recall':[0.89,0.93,0.97,0.95],
'Val_Recall':[0.84,.084,0.84,0.86]})
comparison_frame
| Best Models | Train_Recall | Val_Recall | |
|---|---|---|---|
| 0 | LogRO_best | 0.890 | 0.840 |
| 1 | DTreeU_best | 0.930 | 0.084 |
| 2 | AdaBoost_best | 0.970 | 0.840 |
| 3 | GBU_best | 0.950 | 0.860 |
The model with the highest and best fit recall score is LogRO_best or Logistic Regression with oversampling with a recall training/validation score of 0.89/0.84. This model contains obvious overfitting. The introduction to this project suggested that some variables represnted weather data factors, which influence one another. The correlational analysis also found several varibles with high correlational coefficients. This suggests a degree of colinearity. This may be pulling down all of the scores.
Check for colinarity.
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train_over.values, i) for i in range(X_train_over.shape[1])],
index=X_train_over.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: V1 1102.325 V2 1671.602 V3 12885835843692.406 V4 22239998159854.301 V5 5792411096296.458 V6 12475345228173.119 V7 16773182969722.518 V8 4187447352273.822 V9 31493703687905.566 V10 14274483763456.406 V11 19453994070714.887 V12 21756519938987.902 V13 9748051141494.580 V14 16376725917710.895 V15 36173490982895.547 V16 6757088713234.053 V17 6424535845036.371 V18 14622076712241.871 V19 15189206163138.266 V20 8481355230452.911 V21 10891413850956.459 V22 8578285004515.230 V23 76984609014880.281 V24 16199998659606.102 V25 5345518845543.615 V26 21548323575935.387 V27 18307315558416.652 V28 15037060525444.061 V29 17455812509187.969 V30 9843933611738.789 V31 7718251289409.591 V32 12025633183899.855 V33 10341216136327.201 V34 14297141674192.051 V35 23765697242060.664 V36 18196362130789.883 V37 16742006049704.445 V38 42891425022576.156 V39 26260056136271.113 V40 15502924706955.236 dtype: float64
Every single variable is high, very high, way over 10, which indicates a high degree of colinearity. Beinning with V23, which has the highest VIF, I can try to reduce colinearity and simplify the model in the hopes of improving the scores. This must be done one variable at a time.
Source: InnHotels Project Learner Notebook Full Code
X_train1 = X_train_over.drop("V23", axis=1)
vif_series = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: V1 1102.324 V2 1671.383 V3 26727594227718.078 V4 11682489305760.041 V5 19971616972818.164 V6 15746851843952.783 V7 13964650007350.375 V8 21193410011155.273 V9 30124412223214.020 V10 17388415549693.035 V11 23395322739586.992 V12 27886065804151.680 V13 36028797018963.969 V14 20332278227406.301 V15 14365549050623.592 V16 5317118804451.589 V17 450359962737049.625 V18 8652448851816.515 V19 14790146559509.018 V20 48687563539140.500 V21 10388926476056.508 V22 10735636775615.008 V24 11820471462914.688 V25 10293942005418.277 V26 12978673277724.771 V27 3885763267791.627 V28 15087435937589.602 V29 19123565296690.004 V30 15556475396789.277 V31 23334713095183.918 V32 11161337366469.631 V33 28685347945035.008 V34 13168419963071.625 V35 9919822967776.424 V36 19164253733491.473 V37 9612806034942.361 V38 37374270766560.133 V39 16930825666806.375 V40 8619329430374.155 dtype: float64
Source: InnHotels Project Learner Notebook, Full Code
Dropping V23 significantly increased V17. I will try dropping V17 to see detemine if I can further reduce colinearity. If values continue to increase, this may not be an option.
X_train2 = X_train1.drop("V17", axis=1)
vif_series = pd.Series(
[variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
index=X_train2.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: V1 1102.320 V2 1671.371 V3 9726997035357.443 V4 22860911814063.430 V5 6194772527332.182 V6 17695872799098.215 V7 15583389714084.762 V8 14388497212046.312 V9 10917817278473.930 V10 36764078590779.562 V11 10609186401343.924 V12 14504346625991.936 V13 15885712971324.500 V14 9571943947652.488 V15 27629445566690.160 V16 8378790004410.226 V18 11945887605757.283 V19 19496102282989.160 V20 5285915055599.174 V21 8085457140701.070 V22 21862134113449.008 V24 37219831631161.125 V25 39505259889214.875 V26 21094143453725.977 V27 13207037030412.012 V28 8355472406995.354 V29 25882756479140.781 V30 22350370359158.789 V31 31493703687905.566 V32 39854863959030.938 V33 30741294384781.543 V34 10761289432187.564 V35 8136584692629.622 V36 15087435937589.602 V37 29825163095168.848 V38 26569909306020.625 V39 18123137333482.883 V40 7376903566536.439 dtype: float64
Eliminating the higher number is just increasing VIF values. A second option is to look at the summary data for logistic regression and eliminate variables with the highest p values and watch thier impace pseudo R squared.
# fitting the model on training set
logit = sm.Logit(y_train_over, X_train_over.astype(float))
lg = logit.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.333100
Iterations: 35
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Target No. Observations: 26446
Model: Logit Df Residuals: 26406
Method: MLE Df Model: 39
Date: Wed, 19 Jul 2023 Pseudo R-squ.: 0.5194
Time: 16:24:49 Log-Likelihood: -8809.2
converged: False LL-Null: -18331.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
V1 0.2400 0.191 1.254 0.210 -0.135 0.615
V2 0.2652 0.450 0.590 0.555 -0.616 1.147
V3 0.6369 1.71e+05 3.72e-06 1.000 -3.36e+05 3.36e+05
V4 1.1091 nan nan nan nan nan
V5 -0.2568 nan nan nan nan nan
V6 0.0996 5.11e+05 1.95e-07 1.000 -1e+06 1e+06
V7 -0.1407 nan nan nan nan nan
V8 0.4508 1.62e+05 2.78e-06 1.000 -3.18e+05 3.18e+05
V9 0.1921 1.88e+05 1.02e-06 1.000 -3.69e+05 3.69e+05
V10 0.3600 nan nan nan nan nan
V11 0.7280 nan nan nan nan nan
V12 -0.8945 1.18e+05 -7.57e-06 1.000 -2.32e+05 2.32e+05
V13 0.2017 2.37e+05 8.53e-07 1.000 -4.64e+05 4.64e+05
V14 0.4546 nan nan nan nan nan
V15 -0.5897 nan nan nan nan nan
V16 0.6805 nan nan nan nan nan
V17 0.0037 nan nan nan nan nan
V18 0.5933 1.63e+04 3.64e-05 1.000 -3.2e+04 3.2e+04
V19 0.8416 nan nan nan nan nan
V20 -0.4028 nan nan nan nan nan
V21 0.3022 1.02e+05 2.97e-06 1.000 -1.99e+05 1.99e+05
V22 0.1903 nan nan nan nan nan
V23 0.7050 9.97e+04 7.07e-06 1.000 -1.95e+05 1.95e+05
V24 -0.3336 1.62e+05 -2.06e-06 1.000 -3.17e+05 3.17e+05
V25 0.8838 nan nan nan nan nan
V26 -0.4792 nan nan nan nan nan
V27 -0.3292 nan nan nan nan nan
V28 -0.6515 nan nan nan nan nan
V29 0.0135 nan nan nan nan nan
V30 0.1775 5.08e+04 3.49e-06 1.000 -9.96e+04 9.96e+04
V31 0.1465 8.55e+04 1.71e-06 1.000 -1.67e+05 1.67e+05
V32 -0.0396 nan nan nan nan nan
V33 -0.5451 1.17e+05 -4.66e-06 1.000 -2.29e+05 2.29e+05
V34 -0.1400 nan nan nan nan nan
V35 0.0533 3.66e+04 1.46e-06 1.000 -7.16e+04 7.16e+04
V36 0.2298 nan nan nan nan nan
V37 -0.0106 nan nan nan nan nan
V38 0.8679 nan nan nan nan nan
V39 -0.0616 3.62e+05 -1.7e-07 1.000 -7.1e+05 7.1e+05
V40 0.4523 1.69e+05 2.67e-06 1.000 -3.32e+05 3.32e+05
==============================================================================
Several of the p values exceed 0.05. However, several are also nan, probably because they are negative values. I can start with V3 the first p value of 1 and see how eliminting it changes the model.
Source: InnHotels Project Learner Notebook Full Code
X_train2=X_train_over.drop(["V3"],axis=1)
LogR2=sm.Logit(y_train_over,X_train2.astype(float))
lg2=LogR2.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.333100
Iterations: 35
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
print(lg2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Target No. Observations: 26446
Model: Logit Df Residuals: 26407
Method: MLE Df Model: 38
Date: Wed, 19 Jul 2023 Pseudo R-squ.: 0.5194
Time: 16:41:49 Log-Likelihood: -8809.2
converged: False LL-Null: -18331.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
V1 0.2400 0.191 1.254 0.210 -0.135 0.615
V2 0.2652 0.450 0.590 0.555 -0.616 1.147
V4 1.1246 nan nan nan nan nan
V5 -0.2648 nan nan nan nan nan
V6 0.0382 6.73e+05 5.68e-08 1.000 -1.32e+06 1.32e+06
V7 -0.1680 nan nan nan nan nan
V8 0.4570 5.27e+05 8.68e-07 1.000 -1.03e+06 1.03e+06
V9 0.1667 2.15e+06 7.76e-08 1.000 -4.21e+06 4.21e+06
V10 0.3564 2.32e+06 1.54e-07 1.000 -4.54e+06 4.54e+06
V11 0.6486 1.78e+05 3.65e-06 1.000 -3.48e+05 3.48e+05
V12 -0.8526 nan nan nan nan nan
V13 0.2020 7.23e+05 2.79e-07 1.000 -1.42e+06 1.42e+06
V14 0.3874 1.39e+06 2.78e-07 1.000 -2.73e+06 2.73e+06
V15 -0.5833 nan nan nan nan nan
V16 0.6358 7.41e+05 8.59e-07 1.000 -1.45e+06 1.45e+06
V17 0.0171 8.06e+05 2.12e-08 1.000 -1.58e+06 1.58e+06
V18 0.5636 1.41e+06 4e-07 1.000 -2.76e+06 2.76e+06
V19 0.9157 2.84e+05 3.23e-06 1.000 -5.56e+05 5.56e+05
V20 -0.3792 8.93e+05 -4.25e-07 1.000 -1.75e+06 1.75e+06
V21 0.2055 1.85e+05 1.11e-06 1.000 -3.62e+05 3.62e+05
V22 0.1185 3.49e+06 3.4e-08 1.000 -6.84e+06 6.84e+06
V23 0.5200 8.02e+05 6.48e-07 1.000 -1.57e+06 1.57e+06
V24 -0.3434 nan nan nan nan nan
V25 0.9289 nan nan nan nan nan
V26 -0.3603 9.88e+05 -3.65e-07 1.000 -1.94e+06 1.94e+06
V27 -0.3540 2.95e+05 -1.2e-06 1.000 -5.78e+05 5.78e+05
V28 -0.7054 5.38e+05 -1.31e-06 1.000 -1.05e+06 1.05e+06
V29 0.0416 nan nan nan nan nan
V30 0.1516 4e+05 3.79e-07 1.000 -7.84e+05 7.84e+05
V31 0.2821 nan nan nan nan nan
V32 -0.1214 9.78e+04 -1.24e-06 1.000 -1.92e+05 1.92e+05
V33 -0.5342 2.68e+05 -2e-06 1.000 -5.25e+05 5.25e+05
V34 -0.1276 3.39e+05 -3.76e-07 1.000 -6.65e+05 6.65e+05
V35 0.1592 1.22e+06 1.3e-07 1.000 -2.4e+06 2.4e+06
V36 0.3297 1.74e+06 1.89e-07 1.000 -3.42e+06 3.42e+06
V37 -0.0310 6.29e+05 -4.92e-08 1.000 -1.23e+06 1.23e+06
V38 0.8094 3.39e+05 2.39e-06 1.000 -6.64e+05 6.64e+05
V39 -0.0271 6.24e+05 -4.34e-08 1.000 -1.22e+06 1.22e+06
V40 0.4719 2.52e+05 1.87e-06 1.000 -4.95e+05 4.95e+05
==============================================================================
I am going to drop all the p values with a score of 1.0 all at once just to see what happens. If a problem arises, I can always go back and drop them one at a time.
X_train3=X_train_over.drop(["V6","V8","V9","V10",
"V11","V13","V14","V16","V17","V18",
"V19","V20","V21","V22","V23","V26",
"V27","V28","V32","V33",
"V4","V35","V36","V37","V38","V39","V40"],axis=1)
LogR3=sm.Logit(y_train_over,X_train3.astype(float))
lg3=LogR3.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.333100
Iterations: 35
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Target No. Observations: 26446
Model: Logit Df Residuals: 26433
Method: MLE Df Model: 12
Date: Wed, 19 Jul 2023 Pseudo R-squ.: 0.5194
Time: 16:52:34 Log-Likelihood: -8809.2
converged: False LL-Null: -18331.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
V1 0.2400 0.191 1.254 0.210 -0.135 0.615
V2 0.2652 0.450 0.590 0.555 -0.616 1.147
V3 0.2764 4.95e+04 5.58e-06 1.000 -9.71e+04 9.71e+04
V5 1.4130 8.05e+04 1.76e-05 1.000 -1.58e+05 1.58e+05
V7 0.9763 2.98e+05 3.28e-06 1.000 -5.83e+05 5.83e+05
V12 -0.6240 5.49e+04 -1.14e-05 1.000 -1.08e+05 1.08e+05
V15 0.7803 2.16e+05 3.61e-06 1.000 -4.24e+05 4.24e+05
V24 -0.7804 4.7e+04 -1.66e-05 1.000 -9.2e+04 9.2e+04
V25 0.0505 2.82e+05 1.79e-07 1.000 -5.53e+05 5.53e+05
V29 -1.9782 8.41e+04 -2.35e-05 1.000 -1.65e+05 1.65e+05
V30 2.5336 8.17e+04 3.1e-05 1.000 -1.6e+05 1.6e+05
V31 -0.0831 7.49e+04 -1.11e-06 1.000 -1.47e+05 1.47e+05
V34 0.5482 1.09e+05 5.03e-06 1.000 -2.14e+05 2.14e+05
==============================================================================
All the nan values have now been replaced with 1.0. I am going to recheck the VIF.
vif_series = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: V1 1101.467 V2 1669.314 V3 inf V5 inf V7 inf V12 inf V15 inf V24 inf V25 inf V29 inf V30 inf V31 inf V34 inf dtype: float64
X_train4=X_train3.drop(["V30","V31","V34"],axis=1)
LogR4=sm.Logit(y_train_over,X_train4.astype(float))
lg4=LogR4.fit()
Optimization terminated successfully.
Current function value: 0.333276
Iterations 7
print(lg4.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Target No. Observations: 26446
Model: Logit Df Residuals: 26436
Method: MLE Df Model: 9
Date: Wed, 19 Jul 2023 Pseudo R-squ.: 0.5192
Time: 17:06:48 Log-Likelihood: -8813.8
converged: True LL-Null: -18331.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
V1 0.6155 0.020 30.575 0.000 0.576 0.655
V2 1.3846 0.035 39.812 0.000 1.316 1.453
V3 -0.8026 0.016 -51.343 0.000 -0.833 -0.772
V5 -1.0030 0.019 -52.631 0.000 -1.040 -0.966
V7 -5.9753 0.163 -36.646 0.000 -6.295 -5.656
V12 0.7081 0.027 26.267 0.000 0.655 0.761
V15 3.1218 0.071 43.706 0.000 2.982 3.262
V24 -1.1787 0.033 -35.321 0.000 -1.244 -1.113
V25 -3.4226 0.077 -44.207 0.000 -3.574 -3.271
V29 -1.6468 0.039 -42.606 0.000 -1.723 -1.571
==============================================================================
vif_series = pd.Series(
[variance_inflation_factor(X_train4.values, i) for i in range(X_train4.shape[1])],
index=X_train4.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: V1 10.528 V2 25.748 V3 6.866 V5 3.072 V7 277.548 V12 24.611 V15 168.591 V24 50.735 V25 54.654 V29 25.789 dtype: float64
After playing around with the model and eliminating one at a time (V5,V7,V12,V15,V24,V25,V29,V30,V31 and V34), I discovered that I got optimal model performance (lowest p values and highest pseudo r) by eliminating (V30,V31 and 34). Now, I no longer have infinate numbers for VIF and I will try to further tune the model by eliminating variables with higest VIFs one at a time. Unfortunately, I still have a pretty low pseudo r.
X_train5 = X_train4.drop("V7", axis=1)
vif_series = pd.Series(
[variance_inflation_factor(X_train5.values, i) for i in range(X_train5.shape[1])],
index=X_train5.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: V1 3.313 V2 2.275 V3 2.395 V5 3.065 V12 1.379 V15 3.339 V24 2.509 V25 5.992 V29 2.675 dtype: float64
Now, all VIFs are less than 10. I will check my final model for scores and then bring it into production.
LogR5=sm.Logit(y_train_over,X_train5.astype(float))
lg5=LogR5.fit()
Optimization terminated successfully.
Current function value: 0.361983
Iterations 7
print(lg5.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Target No. Observations: 26446
Model: Logit Df Residuals: 26437
Method: MLE Df Model: 8
Date: Wed, 19 Jul 2023 Pseudo R-squ.: 0.4778
Time: 17:17:22 Log-Likelihood: -9573.0
converged: True LL-Null: -18331.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
V1 -0.0044 0.010 -0.425 0.671 -0.025 0.016
V2 0.1675 0.009 18.489 0.000 0.150 0.185
V3 -0.3670 0.009 -42.501 0.000 -0.384 -0.350
V5 -1.0031 0.019 -53.791 0.000 -1.040 -0.967
V12 -0.2702 0.007 -41.260 0.000 -0.283 -0.257
V15 0.5885 0.011 51.514 0.000 0.566 0.611
V24 0.0233 0.006 3.827 0.000 0.011 0.035
V25 -0.7971 0.023 -34.495 0.000 -0.842 -0.752
V29 -0.3277 0.012 -27.263 0.000 -0.351 -0.304
==============================================================================
Now, V1 has exceeded 0.05 and pseudo r has dropped slightly. I will conduct one final model tweak to see what happens when I drop V1.
X_train6=X_train5.drop(["V1"],axis=1)
LogR6=sm.Logit(y_train_over,X_train6.astype(float))
lg6=LogR6.fit()
Optimization terminated successfully.
Current function value: 0.361987
Iterations 7
print(lg6.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Target No. Observations: 26446
Model: Logit Df Residuals: 26438
Method: MLE Df Model: 7
Date: Wed, 19 Jul 2023 Pseudo R-squ.: 0.4778
Time: 17:44:53 Log-Likelihood: -9573.1
converged: True LL-Null: -18331.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
V2 0.1653 0.007 22.226 0.000 0.151 0.180
V3 -0.3669 0.009 -42.529 0.000 -0.384 -0.350
V5 -0.9989 0.016 -63.406 0.000 -1.030 -0.968
V12 -0.2697 0.006 -41.790 0.000 -0.282 -0.257
V15 0.5880 0.011 51.806 0.000 0.566 0.610
V24 0.0231 0.006 3.809 0.000 0.011 0.035
V25 -0.8005 0.022 -36.863 0.000 -0.843 -0.758
V29 -0.3273 0.012 -27.333 0.000 -0.351 -0.304
==============================================================================
All the p values went above 0.05 when I dropped V1. Therfore, my final model is X_train5.
LogRO_best2 = LogisticRegression(
C=0.2,
class_weight="balanced",
dual=False,
fit_intercept=True,
l1_ratio=1,
max_iter=100,
multi_class="auto",
n_jobs=None,
random_state=1,
solver='lbfgs',
tol=0.0001,
verbose=0,
warm_start=False)
# Fit the best algorithm to the data.
LogRO_best2.fit(X_train6, y_train_over)
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)
X_val2=X_val.drop(["V1","V4","V6","V7","V8","V9","V10","V11","V13","V14","V16","V17","V18",
"V19","V20","V21","V22","V23","V26","V27",
"V28","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40"],axis=1)
X_val2=X_train6
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, LogRO_best2.predict(X_train6)))
print(accuracy_score(y_val, LogRO_best2.predict(X_val2)))
print("Recall on train and validation set")
print(recall_score(y_train_over, LogRO_best.predict(X_train6)))
print(recall_score(y_val, LogRO_best2.predict(X_val2)))
print("Precision on train and validation set")
print(precision_score(y_train_over, LogRO_best2.predict(X_train6)))
print(precision_score(y_val, LogRO_best2.predict(X_val2)))
print("F1 on train and validation set")
print(f1_score(y_train_over, LogRO_best2.predict(X_train6)))
print(f1_score(y_val, LogRO_best2.predict(X_val2)))
print("")
Note:I keep getting an error message stating that values for X and y don't match. I am going to take the following step just so I can get a final score.
y_val.shape
(6000,)
y_train_over.shape
(26446,)
y_train2=y_train_over.sample(n=6000,random_state=1)
X_train6.shape
(26446, 8)
X_train7=X_train6.sample(n=6000,random_state=1)
X_val2.shape
(26446, 8)
X_val3=X_val2.sample(n=6000,random_state=1)
print("Accuracy on train and validation set")
print(accuracy_score(y_train2, LogRO_best2.predict(X_train7)))
print(accuracy_score(y_val, LogRO_best2.predict(X_val3)))
print("Recall on train and validation set")
print(recall_score(y_train2, LogRO_best2.predict(X_train7)))
print(recall_score(y_val, LogRO_best2.predict(X_val3)))
print("Precision on train and validation set")
print(precision_score(y_train2, LogRO_best2.predict(X_train7)))
print(precision_score(y_val, LogRO_best2.predict(X_val3)))
print("F1 on train and validation set")
print(f1_score(y_train2, LogRO_best2.predict(X_train7)))
print(f1_score(y_val, LogRO_best2.predict(X_val3)))
print("")
Accuracy on train and validation set 0.8703333333333333 0.5005 Recall on train and validation set 0.8752528658125421 0.5105105105105106 Precision on train and validation set 0.8641810918774967 0.05659121171770972 F1 on train and validation set 0.8696817420435511 0.10188792328438717
Any efforts I have made to improve this model have failed. I will go back to my original model LogRO_best and employ it.
Note:I am not satisfied with the results. I am going start over and run everything again through the simplified dataset so long as a correlational analysis reveals that most of the high correlations have been eliminiated.
wind_simp=wind.copy()
wind_simp=wind_simp.drop(["V1","V4","V6","V7","V8","V9","V10","V11","V13","V14","V16","V17","V18",
"V19","V20","V21","V22","V23","V26","V27",
"V28","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40"],axis=1)
wind_simp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V2 19982 non-null float64 1 V3 20000 non-null float64 2 V5 20000 non-null float64 3 V12 20000 non-null float64 4 V15 20000 non-null float64 5 V24 20000 non-null float64 6 V25 20000 non-null float64 7 V29 20000 non-null float64 8 Target 20000 non-null int64 dtypes: float64(8), int64(1) memory usage: 1.4 MB
plt.figure(figsize=(8,8))
sns.heatmap(data=wind[["V2","V3","V5","V12","V15","V24","V25",
"V29","Target"]]
.corr(),annot=True,cbar=False,cmap="Spectral")
<Axes: >
All the correlations of 0.70 or higher are gone.
X = wind_simp.drop("Target",axis=1)
y = wind_simp.pop("Target")
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
# fit the imputer on train data and transform the train data
X_train["V2"] = imp_median.fit_transform(X_train[["V2"]])
X_val["V2"] = imp_median.fit_transform(X_val[["V2"]])
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("LogR", LogisticRegression(random_state=1)))
models.append(("DTree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("RF",RandomForestClassifier(random_state=1)))
models.append(("GB", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: LogR: 0.4027956989247312 DTree: 0.5817617866004963 AdaBoost: 0.34480562448304386 Bagging: 0.5211662531017369 RF: 0.5276757650951198 GB: 0.5005459057071959 Validation Performance: LogR: 0.4084084084084084 DTree: 0.6186186186186187 AdaBoost: 0.3333333333333333 Bagging: 0.5855855855855856 RF: 0.5675675675675675 GB: 0.5195195195195195
Although the recall scores are not particulary high, there is definately less overfitting. Over or undersampling will reduce imbalance and hopefully increase the recall scores.
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
There are still outliers that will skew the data. However, LogR and GB, now have medians in near the center of the distribution.
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 777 Before OverSampling, counts of label '0': 13223 After OverSampling, counts of label '1': 13223 After OverSampling, counts of label '0': 13223 After OverSampling, the shape of train_X: (26446, 8) After OverSampling, the shape of train_y: (26446,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("LogRO", LogisticRegression(random_state=1)))
models.append(("BaggingO", BaggingClassifier(random_state=1)))
models.append(("DTreeO", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostO", AdaBoostClassifier(random_state=1)))
models.append(("RFO", RandomForestClassifier(random_state=1)))
models.append(("GBO", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
) ## Complete the code to build models on oversampled data
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over,y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: LogRO: 0.8779405666501748 BaggingO: 0.9691452201939548 DTreeO: 0.9540956161398351 AdaBoostO: 0.8670514400761864 RFO: 0.9769347868984667 GBO: 0.9149970400578834 Validation Performance: LogRO: 0.8348348348348348 BaggingO: 0.7927927927927928 DTreeO: 0.6996996996996997 AdaBoostO: 0.8318318318318318 RFO: 0.8078078078078078 GBO: 0.8648648648648649
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Nothing very impressive yet. Lots of overfitting although the DTreeO's distrution now looks almost symmetirical.
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 777 Before UnderSampling, counts of label '0': 13223 After UnderSampling, counts of label '1': 777 After UnderSampling, counts of label '0': 777 After UnderSampling, the shape of train_X: (1554, 8) After UnderSampling, the shape of train_y: (1554,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("LogRU", LogisticRegression(random_state=1)))
models.append(("BaggingU", BaggingClassifier(random_state=1)))
models.append(("DTreeU", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostU", AdaBoostClassifier(random_state=1)))
models.append(("RFU", RandomForestClassifier(random_state=1)))
models.append(("GBU", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
) ## Complete the code to build models on oversampled data
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un,y_train_un)## Complete the code to build models on oversampled data
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: LogRU: 0.8648800661703888 BaggingU: 0.8507526881720431 DTreeU: 0.8120430107526883 AdaBoostU: 0.8223986765922249 RFU: 0.8893382961124896 GBU: 0.875144747725393 Validation Performance: LogRU: 0.8468468468468469 BaggingU: 0.8408408408408409 DTreeU: 0.7927927927927928 AdaBoostU: 0.8468468468468469 RFU: 0.8708708708708709 GBU: 0.8738738738738738
Much higher and closer Recall scores on the undersampled simplified data. There GBU actually fits very closely with train/val scores of .878 and .874. Those are the closest I have seen so far. BaggingU scores are a close second with .851/.841. RFU comese in third at .85/.841.
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
comparison_frame1 = pd.DataFrame({'Base Model':['LogR','DTree','AdaBoost','Bagging',
'RF','GB'],
'Train_Recall':[0.40,0.58,0.34,0.52,0.52,0.50],
'Val_Recall':[0.40,0.61,0.33,0.58,0.56,0.51]})
comparison_frame2=pd.DataFrame({'Oversample':['LogRO','BaggingO','DTreeO','AdaBoostO','RFO','GBO'],
'Train Recall':[0.87,0.96,0.95,0.86,0.97,0.91],
'Val_Recall':[0.83,0.79,0.70,0.83,0.81,0.86],})
comparison_frame3=pd.DataFrame({'Undersample':['LogRU','DTreeU','AdaBoostU','BaggingU','RFU','GBU'],
'Train_Recall':[0.864,0.850,0.812,0.822,0.889,0.873],
'Test_Recall':[0.846,0.840,0.792,0.846,0.870,0.875]})
comparison_frame1
| Base Model | Train_Recall | Val_Recall | |
|---|---|---|---|
| 0 | LogR | 0.400 | 0.400 |
| 1 | DTree | 0.580 | 0.610 |
| 2 | AdaBoost | 0.340 | 0.330 |
| 3 | Bagging | 0.520 | 0.580 |
| 4 | RF | 0.520 | 0.560 |
| 5 | GB | 0.500 | 0.510 |
comparison_frame2
| Oversample | Train Recall | Val_Recall | |
|---|---|---|---|
| 0 | LogRO | 0.870 | 0.830 |
| 1 | BaggingO | 0.960 | 0.790 |
| 2 | DTreeO | 0.950 | 0.700 |
| 3 | AdaBoostO | 0.860 | 0.830 |
| 4 | RFO | 0.970 | 0.810 |
| 5 | GBO | 0.910 | 0.860 |
comparison_frame3
| Undersample | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | LogRU | 0.864 | 0.846 |
| 1 | DTreeU | 0.850 | 0.840 |
| 2 | AdaBoostU | 0.812 | 0.792 |
| 3 | BaggingU | 0.822 | 0.846 |
| 4 | RFU | 0.889 | 0.870 |
| 5 | GBU | 0.873 | 0.875 |
Best models are LogRU (.864/.846), RFU(.889/.870), GBU(.873/.875).
I really want to treat this dataset for outliers. However, we have been told so many times not to eliminate genuine data especially when it represents continuous values, so I haven't done it.
A. LogRU (Logistic Regression with undersampling)
# defining model
LogRU_tuned = LogisticRegression(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'C':np.arange(0.1,1.1,0.1)}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=LogRU_tuned, param_distributions=param_grid,
n_iter=10, n_jobs = -1, verbose=2,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'C': 0.1} with CV score=0.8674689826302731:
# Set the clf to the best combination of parameters
LogRU_best = LogisticRegression(
C=0.1,
class_weight="balanced",
dual=False,
fit_intercept=True,
l1_ratio=1,
max_iter=100,
multi_class="auto",
n_jobs=-1,
random_state=1,
solver='lbfgs',
tol=0.0001,
verbose=2,
warm_start=True)
# Fit the best algorithm to the data.
LogRU_best.fit(X_train_over, y_train_over)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
LogisticRegression(C=0.1, class_weight='balanced', l1_ratio=1, n_jobs=-1,
random_state=1, verbose=2, warm_start=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LogisticRegression(C=0.1, class_weight='balanced', l1_ratio=1, n_jobs=-1,
random_state=1, verbose=2, warm_start=True)print("Accuracy on train and validation set")
print(accuracy_score(y_train_un, LogRU_best.predict(X_train_un)))
print(accuracy_score(y_val, LogRU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_un, LogRU_best.predict(X_train_un)))
print(recall_score(y_val, LogRU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_un, LogRU_best.predict(X_train_un)))
print(precision_score(y_val, LogRU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_un, LogRU_best.predict(X_train_un)))
print(f1_score(y_val, LogRU_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.8552123552123552 0.8475 Recall on train and validation set 0.8481338481338482 0.8348348348348348 Precision on train and validation set 0.860313315926893 0.24428822495606328 F1 on train and validation set 0.8541801685029166 0.37797416723317473
Good accuarcy (0.855/0.847) and recall(0.848/0.834). Good scores and little indication of overfiting. Not so good on precision (0.860/0.244) which looks at the balance between TP and FP. Since the precision score is low, it makes sense that the F1 score (0.854/0.377) is also low.
B. RFU (Random Forest with undersampling)
# defining model
RFU_tuned = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {"n_estimators":[100.150,200],
"min_samples_leaf":np.arange(1,11,1),
"max_features":[np.arange(0.10,0.80,0.1),'sqrt'],
"max_samples":np.arange(0.2,0.9,0.10)}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=RFU_tuned, param_distributions=param_grid,
n_iter=10, n_jobs = -1, verbose=2,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'n_estimators': 200, 'min_samples_leaf': 5, 'max_samples': 0.6000000000000001, 'max_features': 'sqrt'} with CV score=0.894449958643507:
# Set the clf to the best combination of parameters
RFU_best = RandomForestClassifier(
min_samples_leaf=5,
max_samples=0.60,
max_features='sqrt',
bootstrap=True,)
# Fit the best algorithm to the data.
RFU_best.fit(X_train_un, y_train_un)
RandomForestClassifier(max_samples=0.6, min_samples_leaf=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, min_samples_leaf=5)
print("Accuracy on train and validation set")
print(accuracy_score(y_train_un, RFU_best.predict(X_train_un)))
print(accuracy_score(y_val, RFU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_un, RFU_best.predict(X_train_un)))
print(recall_score(y_val, RFU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_un, RFU_best.predict(X_train_un)))
print(precision_score(y_val, RFU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_un, RFU_best.predict(X_train_un)))
print(f1_score(y_val, RFU_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.9253539253539254 0.853 Recall on train and validation set 0.9214929214929215 0.8678678678678678 Precision on train and validation set 0.9286640726329443 0.25643300798580304 F1 on train and validation set 0.9250645994832041 0.3958904109589042
Accuracy and recall scores not as close as LogRU, but alright. Accuracy (0.925/0.853). Recall score(0.921/0.867) indicatiing some overfitting0. Still low scores with significant overfitting for both precision (0.92/0.21) and F1 (0.92/0.39).
C. GBU (Gradient Boosting with undersampling)
# defining model
GBU_tuned = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators":np.arange(100,150,25),
"learning_rate":[0.2,0.05,1.0],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=GBU_tuned, param_distributions=param_grid,
n_iter=10, n_jobs = -1, verbose=2,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters are {'n_estimators': 125, 'learning_rate': 0.2} with CV score=0.8880314309346569:
# Set the clf to the best combination of parameters
GBU_best = GradientBoostingClassifier ( n_estimators=125,
learning_rate=0.2,
)
# Fit the best algorithm to the data.
GBU_best.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
print("Accuracy on train and validation set")
print(accuracy_score(y_train_un, GBU_best.predict(X_train_un)))
print(accuracy_score(y_val, GBU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_un, GBU_best.predict(X_train_un)))
print(recall_score(y_val, GBU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_un, GBU_best.predict(X_train_un)))
print(precision_score(y_val, GBU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_un, GBU_best.predict(X_train_un)))
print(f1_score(y_val, GBU_best.predict(X_val)))
print("")
Accuracy on train and validation set 0.9851994851994852 0.8978333333333334 Recall on train and validation set 0.9716859716859717 0.8708708708708709 Precision on train and validation set 0.9986772486772487 0.3372093023255814 F1 on train and validation set 0.984996738421396 0.4861693210393964
The best tuned model is LogRU with an train/val accuracy score of 0.855/0.847, a recall score of 0.848/0.834, a precision score of 0.860/0.244 and an F1 score of 0.854/0.377. As stated earlier, the precision score compares TP to FP and F1 compares Precision and Recall or the balance between FP and FN or Type I and Type II errors.
print (pd.DataFrame(GBU_best.feature_importances_, columns = ["Imp"], index = X_train_un.columns).sort_values(by = 'Imp', ascending = False))
Imp V15 0.298 V3 0.211 V5 0.125 V25 0.114 V12 0.099 V29 0.066 V2 0.050 V24 0.038
Source: Easy Visa Project Learner Notebook Full Code
The most important features in this model are V15(30%), V3(21%), V5(13%), and V25(11%). That explains almost 75% of the the phenomenon.
test=pd.read_csv('/content/drive/MyDrive/Test.csv.csv')
wind_test=wind.copy()
wind_test=wind_test.drop(["V1","V4","V6","V7","V8","V9","V10","V11","V13","V14","V16","V17","V18",
"V19","V20","V21","V22","V23","V26","V27",
"V28","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40"],axis=1)
wind_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V2 19982 non-null float64 1 V3 20000 non-null float64 2 V5 20000 non-null float64 3 V12 20000 non-null float64 4 V15 20000 non-null float64 5 V24 20000 non-null float64 6 V25 20000 non-null float64 7 V29 20000 non-null float64 8 Target 20000 non-null int64 dtypes: float64(8), int64(1) memory usage: 1.4 MB
wind_test.head()
| V2 | V3 | V5 | V12 | V15 | V24 | V25 | V29 | Target | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.679 | 3.102 | -0.221 | 0.736 | -3.376 | 3.133 | 0.665 | -3.982 | 0 |
| 1 | 3.653 | 0.910 | 0.332 | -0.951 | 0.193 | 1.766 | -0.267 | 0.783 | 0 |
| 2 | -5.824 | 0.634 | -1.774 | 1.107 | -3.164 | 1.680 | -0.451 | -2.034 | 0 |
| 3 | 1.888 | 7.046 | 0.083 | 0.460 | -0.454 | -1.818 | 2.124 | -3.963 | 0 |
| 4 | 3.872 | -3.758 | 3.793 | 4.724 | -2.633 | 4.490 | -3.945 | 5.107 | 0 |
wind_test.describe()
| V2 | V3 | V5 | V12 | V15 | V24 | V25 | V29 | Target | |
|---|---|---|---|---|---|---|---|---|---|
| count | 19982.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 | 20000.000 |
| mean | 0.440 | 2.485 | -0.054 | 1.605 | -2.415 | 1.134 | -0.002 | -0.986 | 0.056 |
| std | 3.151 | 3.389 | 2.105 | 2.930 | 3.355 | 3.912 | 2.017 | 2.684 | 0.229 |
| min | -12.320 | -10.708 | -8.603 | -12.948 | -16.417 | -16.387 | -8.228 | -12.579 | 0.000 |
| 25% | -1.641 | 0.207 | -1.536 | -0.397 | -4.415 | -1.468 | -1.365 | -2.787 | 0.000 |
| 50% | 0.472 | 2.256 | -0.102 | 1.508 | -2.383 | 0.969 | 0.025 | -1.176 | 0.000 |
| 75% | 2.544 | 4.566 | 1.340 | 3.571 | -0.359 | 3.546 | 1.397 | 0.630 | 0.000 |
| max | 13.089 | 17.091 | 8.134 | 15.081 | 12.246 | 17.163 | 8.223 | 10.722 | 1.000 |
Observation:V2 still contains missing values, so they must be imputed.
# pipeline takes a list of tuples as parameter. The last entry is the call to the modeling algorithm
pipeline = Pipeline([
('scaler',StandardScaler()), 'under_sample',RandomUnderSampler(random_state=1,sampling_strategy=1),
('gr', GradientBoostingClassifier(learning_rate=0.2,n_estimators=125))
])
Source: Hands-on Notebook Pipeline and Make Pipeline
Note:This is the first pipeline I built. It includes all steps, but gave me an error message.
pipeline2 = Pipeline([
('scaler',StandardScaler()),
('bestgb', GradientBoostingClassifier(learning_rate=0.2,n_estimators=125))
])
source: Hands-on Pipline and Make Pipeline
X = wind_test.drop("Target",axis=1)
y = wind_test.pop("Target")
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
# fit the imputer on train data and transform the train data
X["V2"] = imp_median.fit_transform(X[["V2"]])
6.Fit the pipline.
pipeline2.fit(X,y)
Pipeline(steps=[('scaler', StandardScaler()),
('bestgb',
GradientBoostingClassifier(learning_rate=0.2,
n_estimators=125))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('scaler', StandardScaler()),
('bestgb',
GradientBoostingClassifier(learning_rate=0.2,
n_estimators=125))])StandardScaler()
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
Source: Hands_on Notebook Pipeline and Make Pipeline
pipeline2.score(X,y)
0.9867
pipeline2.score(X_train_un,y_train_un)
0.888030888030888
Source: Hands-on Notebook Pipeline and Make Pipeline
Note:I ran the score a second time because the first did not include the undersampling step, which was importante in fitting the model.
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_test_un, y_test_un = rus.fit_resample(X, y)
print("Before UnderSampling, counts of label '1': {}".format(sum(y == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_test_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_test_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_test_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_test_un.shape))
Before UnderSampling, counts of label '1': 1110 Before UnderSampling, counts of label '0': 18890 After UnderSampling, counts of label '1': 1110 After UnderSampling, counts of label '0': 1110 After UnderSampling, the shape of train_X: (2220, 8) After UnderSampling, the shape of train_y: (2220,)
GBU_best_test = GradientBoostingClassifier ( n_estimators=125,
learning_rate=0.2,
)
# Fit the best algorithm to the data.
GBU_best_test.fit(X_test_un, y_test_un)
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
print("Accuracy on train and test set")
print(accuracy_score(y_train_un, GBU_best.predict(X_train_un)))
print(accuracy_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("Recall on train and test set")
print(recall_score(y_train_un, GBU_best.predict(X_train_un)))
print(recall_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("Precision on train and test set")
print(precision_score(y_train_un, GBU_best.predict(X_train_un)))
print(precision_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("F1 on train and test set")
print(f1_score(y_train_un, GBU_best.predict(X_train_un)))
print(f1_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("")
Accuracy on train and test set 0.9851994851994852 0.9707207207207207 Recall on train and test set 0.9716859716859717 0.9585585585585585 Precision on train and test set 0.9986772486772487 0.9824561403508771 F1 on train and test set 0.984996738421396 0.9703602371181029
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test_un)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (5,5))
sns.heatmap(df_cm, annot=labels,cbar=False,cmap="Spectral",fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
make_confusion_matrix(GBU_best_test,y_test_un)
Source: Project_SLC_InnHotels_Project_FullCode
feature_names = list(X_train_un.columns)
importances = GBU_best_test.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Source: Project_SCL_DBSA_InnHotels_FullCode
print (pd.DataFrame(GBU_best_test.feature_importances_, columns = ["Imp"], index = X_test_un.columns).sort_values(by = 'Imp', ascending = False))
Imp V15 0.316 V3 0.197 V5 0.126 V12 0.112 V25 0.078 V29 0.065 V2 0.053 V24 0.052
Observations: Still slight overfitting on the training vs test data. However, the results are much better than expected.
Conclusions: This is an assignment for a course and not a genuine data analysis. I could continue with data engineering and drop non continuous outliers that lie more than 3 standard deviations above the median. I did not do that because if I continue data engineering, I will run out of time and be unable to complete the assignment by the deadline. I also checked the low-code version of this notebook and this was not one of the steps. If I did drop non-continuous outliers, I would be eliminating data points, but I believe this would also make the model more generalizable.
Please see presentation. Thank you.